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1.  Introduction 


In  February  2012,  a  dedicated  high  performance  computing  project  investment  (DHPI)  was 
installed  at  the  U.S.  Army  Research  Laboratory  (ARL)  facilities  on  the  Aberdeen  Proving 
Ground  (APG)  in  MD.  This  report  documents  the  installation,  acceptance  testing,  vendor 
interactions,  and  final  system  acceptance  for  production  usage.  The  cluster  was  unique  within  the 
Department  of  Defense  (DOD)  at  the  time  due  to  its  use  of  a  low  latency  10-gigabit  Ethernet 
(GigE)  network  fabric,  many-core  (12)  AMD  Opteron  central  processing  units  (CPUs),  Nvidia 
Eermi  graphics  processing  units  (GPUs),  and  a  mixture  of  CPU-only  and  CPU/GPU  nodes.  The 
combined  government/contractor  team  worked  for  more  than  7  months  to  complete  acceptance 
of  the  system  and  provide  feedback  to  the  vendor  for  system  failures  and  repairs.  The  final 
accepted  production  system  is  stable  and  capable  of  nearly  1  PetaEEOPS  (10  floating-point 
operations  per  second)  of  single  precision  performance. 


2.  Background 


The  Thufir  system  is  dedicated  to  network  simulation  and  emulation  with  a  focus  on  mobile  ad- 
hoc  networks  (MANETs)  and  the  Mobile  Network  Modeling  Institute  (MNMI).  The  name  Thufir 
comes  from  a  Mentat  character  in  the  Dune  book  series.  Mentats  are  humans  trained  as  a 
replacement  for  computerized  calculation  (7).  The  MNMI  was  established  in  fiscal  year  2007  to 
exploit  high  performance  computing  (HPC)  through  the  development  of  computational  software 
that  enables  the  DOD  to  design,  test,  and  optimize  networks  at  sufficient  levels  of  fidelity  and 
with  sufficient  speed  to  understand  the  behaviors  of  network-centric  warfare  technologies  in  the 
full  range  of  conditions  in  which  they  will  be  deployed.  Operational  goals  include  the 
development  of  scalable  computational  modeling  tools  for  simulations  and  emulations,  the 
ability  to  understand  a  priori  the  performance  of  current  and  proposed  radio  waveforms  such  as 
the  Wideband  Networking  Waveform  (WNW)  in  the  field,  and  to  optimize  the  network  for  U.S. 
Soldiers. 

Wireless  networks  have  become  a  fixture  in  the  modern  world,  and  the  shortcomings  of  such 
networks  are  encountered  all  too  often;  dropped  calls;  variations  in  data  traffic  bandwidth  as  a 
function  of  distance  from  a  wireless  access  point  or  from  walls  and  other  objects  that  block  line 
of  sight  to  an  access  point;  and  interference  from  other  radio  sources.  In  everyday  life,  users 
often  adapt  by  relocating  to  a  point  where  there  is  a  better  wireless  signal,  and  providers  adapt  by 
increasing  the  density  of  access  points  in  areas  with  spotty  reception. 
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The  bulk  of  research  conducted  in  academia  focuses  on  wireless  cellular  networking  as 
previously  described.  While  aspects  of  cellular  networks  carry  through  to  mobile  ad  hoc 
networks,  the  two  are  distinct  in  many  ways  with  mobile  networks  posing  greater  challenges.  In 
mobile  ad  hoc  networks,  access  points  can  move  and  coverage  may  vary  widely  in  a  region.  At 
times,  the  access  points  will  cluster  together,  leaving  parts  of  the  map  with  sparse  coverage  and 
parts  with  compromised  service  due  to  competition  for  available  bandwidth  as  channel 
subscriptions  become  saturated.  On  the  battlefield,  it  may  not  be  an  option  to  relocate  receivers 
to  areas  with  better  coverage,  which  highlights  the  need  to  optimize  and  plan  prior  to  mission  as 
much  as  possible.  Consider  the  problem  of  combat  in  an  urban  area.  Narrow  streets  and  buildings 
with  metal  roofs  and  reinforced  concrete  walls  may  interfere  with  radio  reception  and  access  to 
surveillance  data,  yet  that  surveillance  information  may  be  key  to  locating  enemy  combatants 
before  they  inflict  casualties  on  friendly  forces. 

Mobile  networks  must  be  understood  at  a  series  of  levels,  from  radio  to  packet  network  to 
communication  infrastructure  and  its  resulting  impact  on  the  Warfighter.  To  make  this  difficult 
problem  tractable,  the  MNMI  has  developed  a  four-pronged  approach. 

1 .  The  first  is  MANET  simulation,  where  large-scale  HPC  assets  can  be  used  to  test  and 
optimize  large  radio  deployments.  These  simulations  are  most  often  limited  to  determine 
whether  radios  can  “see”  each  other. 

2.  Second  is  MANET  emulation,  where  researchers  can  investigate  performance  of  proposed 
radios  high  in  the  network  stack  (application  layer)  all  the  way  to  the  lower  physical  layers. 

3.  Third  is  MANET  experimentation,  where  live  and  constructive  exercises  can  capture  real 
radio  performance  and  test  interoperability  with  real  and  virtual  assets.  This  also  results  in 
data  sets  that  can  be  later  mined  to  fill  data  gaps  and  verify  models. 

4.  Fourth  is  a  system  that  ties  together  all  of  these  aspects  and  brings  in  support  for 
visualization  and  data  analysis.  The  MNMI  and  Command,  Control,  Communications, 
Computers,  Intelligence,  Surveillance  and  Reconnaissance-Network  Modernization  are 
addressing  all  of  these  topics.  Thufir  is  focused  primarily  on  two  efforts,  namely  simulation 
and  emulation  but  is  also  greatly  enhancing  all  four  focus  areas  of  the  MNMI. 

The  scale  and  complexity  of  mobile  ad  hoc  networks  is  unique  to  the  DOD,  and  to  the  Army  in 
particular,  as  a  mobile  fighting  force.  The  military  is  rapidly  becoming  a  network-centric  force, 
with  substantial  access  to  sensor-derived  surveillance  information  as  well  as  an  increasingly 
complicated  application  layer  running  over  many  different  devices.  This  introduces  significant 
advantages  to  the  Warfighter  but  also  brings  in  new  dependencies  and  new  risks  from  the  rapid 
change  in  configurations  of  the  MANETs  that  provide  network  access  across  the  battlefield. 

Unfortunately,  it  is  difficult  to  design  and  evaluate  mobile  ad  hoc  networks  with  sufficient 
fidelity  and  scale.  The  research  plans  and  focus  of  the  MNMI  are  addressing  the  primary  issues 
associated  with  optimized  MANET  planning  and  execution.  Thufir  provides  dedicated  HPC 
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resources  to  assist  MANET -related  research  to  help  researchers  identify  the  principles  for 
MANET  optimization  and  planning  without  trial-and-error  testing  that  will  significantly  increase 
Soldier  risk,  jeopardize  mission  success,  and  not  map  well  to  future  missions. 


3.  Dedicated  High  Performance  Computing  Project  Investment:  Thufir 


MANET  emulations  on  a  large  scale,  up  to  5000  emulated  devices,  require  dedicated  resources 
that  allow  for  the  research,  development,  and  evaluation  of  network  algorithms  in  a  controlled 
environment.  In  addition,  real-time  MANET  emulated  devices  may  be  inserted  into  live 
experiments  to  augment  testing  and  perceived  traffic  by  the  physical  network  devices  in  the  field. 
This  enhances  the  experience  of  testers  and  increases  the  degrees  of  freedom  that  can  be 
evaluated  in  an  experiment.  To  achieve  these  goals,  a  number  of  technical  gaps  must  be  bridged, 
including  the  real-time  computation  of  radio  frequency  (RE)  propagation  in  urban  environments 
and  the  high-fidelity  software  emulation  of  network  devices. 

MANET  emulations  require  hardware-in-the-loop  that  can  create  a  controllable,  repeatable 
virtual  environment  for  the  testing  and  evaluation  of  network  devices,  both  real  and  emulated. 
The  software  emulation  of  network  devices  requires  low-level  access  to  the  Internet  Protocol  (IP) 
transport  capable  network  interfaces  connected  to  the  host  machine.  This  low-level  access  allows 
the  emulated  network  stack  to  process  incoming  IP  packets  or  Ethernet  frames  and  forward  them 
through  the  emulated  network  stack.  Each  emulated  device  requires  isolated  network  access  that 
is  provided  using  virtualization  technology.  The  virtualization  technologies  include  full  hardware 
virtualization  such  as  KVM  (kernel-based  virtual  machine),  XEN,  and  VMware  or  container- 
based  virtualization  such  as  OpenVZ.  Unfortunately,  each  of  these  technologies  requires  a 
specialized  operating  system  kernel  in  order  to  be  used.  This  presents  a  problem  for  utilization  of 
shared  resources  where  stability  and  security  are  the  primary  concerns  and  is  a  motivating  factor 
to  have  a  dedicated  resource  such  as  Thufir. 

RE  wave  propagation  models  play  an  essential  role  in  planning,  analysis,  and  optimization  of 
radio  networks  (2).  Eor  instance,  coverage  and  interference  estimates  of  network  configurations 
are  based  on  field  strength  predictions.  Approaches  for  field  strength  prediction  can  be  divided 
into  semiempirical  and  ray-optical  models.  Eor  example,  the  semiempirical  COST-Walfisch- 
Ikegami  model  (5)  estimates  the  received  power  predominantly  on  the  basis  of  frequency  and 
distance  to  the  transmitter.  Ray-optical  approaches  identify  ray  paths  through  the  scene,  based  on 
wave  guiding  effects  like  reflection  and  diffraction.  Semiempirical  algorithms  usually  offer  fast 
computation  times  but  suffer  from  inherent  low  prediction  quality.  Ray-optical  algorithms 
feature  a  higher  prediction  quality  at  the  cost  of  higher  computation  times.  Eor  MANET 
emulation  integration,  these  algorithms  must  be  computed  in  real-time  for  each  of  the 

■y 

propagation  paths  possible,  with  an  0(n  )  complexity. 
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GPUs  such  as  those  in  Thufir  have  been  identified  as  a  solution  to  provide  the  raw  floating-point 
performanee  required  to  compute  eaeh  RF  propagation  path  loss  in  real-time.  GPU  eomputing 
provides  great  promise  for  improved  performanee  over  traditional  CPUs  for  floating-point 
intensive  applications.  The  GPU  arehitecture  is  also  ideally  suited  for  aeeelerating  ray-tracing 
algorithms  that  are  found  in  the  ray-optieal  approaehes  for  RF  wave  propagation  modeling.  In 
addition,  these  RF  propagation  computations  must  be  tightly  eoupled  with  the  MANET 
emulation  environment  to  provide  real-time  (defined  to  be  less  than  0.5  s  based  on  routing-table 
refresh  rates)  response  to  eomputation  requests. 

The  environment  eurrently  used  within  the  MNMI  for  MANET  emulations  requires  that  all  radio 
positions  be  determined  a  priori  beeause  of  the  eostly  path  loss  eomputations.  This  means  that 
real-time  adjustments  based  on  foree-on-foree  interaetions  eannot  be  used  in  tandem  with 
MANET  emulations,  thus  limiting  the  utility  of  the  emulation  environment.  The  eapability  to 
dynamieally  alter  radio  positions  and  trajeetories  would  add  a  level  of  realism  to  the  network 
emulations  and  simulations  that  is  not  eurrently  possible.  In  addition  to  potentially  real-time 
propagation  modeling,  a  GPU-aeeelerated  algorithm  would  be  the  basis  for  adding  higher  fidelity 
modeling  eapabilities  sueh  as  foliage  {4)  and  weather  effects. 

In  general,  eurrent  commercial  simulation  software  paekages  only  provide  low  to  medium 
fidelity  RF  propagation  models.  High-fidelity  models,  sueh  as  those  developed  in  the  eleetronie 
battlefield  Common  High  Performance  Sealable  Software  Initiative  portfolio  (5),  are  required  by 
MANET  emulations  to  provide  realistie  stimulation  to  live  experiments  and  for  the  assessment  of 
emerging  teehnologies.  Accurate  modeling  of  RF  wave  propagation  requires  path  loss 
eomputations  due  to  free  spaee  distanees,  wave  refleetions,  refraetion,  weather  effeets,  and 
adsorption.  Eaeh  of  these  factors  contributes  to  some  degree  to  the  computational  requirements 
of  the  propagation  model.  For  instanee,  to  inelude  foliage  in  the  propagation  model,  adsorption 
and  scattering  by  branehes  and  leaves  must  be  eonsidered.  Modeling  of  the  forest  parameters 
must  be  eonsidered,  sueh  as  tree  types  (deeiduous  or  coniferous)  and  tree  density  {4). 

Initially,  we  will  be  running  two  software  applieations  with  RF  propagation  models  using  general 
purpose  graphies  proeessing  units  (GPGPUs).  Both  of  these  applieations  were  developed  as  part 
of  the  MNMI.  The  first  is  based  on  the  Irregular  Terrain  Model  (ITM)  or  Longley-Riee  model. 
The  seeond  applieation  is  a  ray-traeing  algorithm  using  GPUs  to  compute  line-of-sight  paths 
between  deviees.  Ray-traeing  is  primarily  required  for  urban  environments  where  the  refleetions 
and  refraetions  from  buildings  and  walls  are  of  primary  importanee.  Both  of  the  RF  propagation 
models  currently  execute  in  parallel  on  multiple  GPUs  and  were  expeeted  to  seale  well  with  the 
number  of  GPUs. 

We  have  foeused  primarily  on  the  MANET  emulation  software  extendable  mobile  ad-hoe 
network  emulator  (EMANE)  that  is  eapable  of  sealing  up  to  thousands  of  MANET  deviees  with 
the  resourees  provided  by  Thufir.  In  addition  to  an  802.1 1  a/b/g  wireless  model  and  a 
eonfigurable  radio  model,  a  hi-fidelity  Soldier  Radio  Waveform  (SRW)  model  that  requires 
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increased  computational  effort  beyond  the  standard  civilian  waveforms  is  also  available.  ARL  is 
also  investigating  the  development  of  additional  Army  radio  waveforms  sueh  as  WNW  and 
WNaN  (Wireless  Network  after  Next)  that  will  have  even  higher  bandwidth  requirements  than 
SRW  in  addition  to  a  potentially  inereased  computational  complexity.  Emulating  a  hybrid 
network  eonsisting  of  multiple  Army  radio  waveforms  will  require  a  network  intereonneet  that 
can  support  both  high  bandwidth  (>  1  Gb/s)  and  low  latency  (<3  ps)  because  of  the 
conneetedness,  0(n  ),  of  the  MANET.  This  exponential  sealing  of  traffie  also  exponentially 
inereases  the  computational  requirements  of  the  EMANE  software. 

The  amount  of  real  network  traffic  generated  by  MANET  emulations  presents  a  unique  problem 
for  production  machines.  Each  packet  generated  from  EMANE  is  broadcast  using  multicast 
across-the -network  switching  fabric.  The  volume  of  traffie  generated  is  exponential  based  on  the 
number  of  radios  and  will  quiekly  saturate  even  a  high  bandwidth/low  latency  interconnect  such 
as  Infiniband  or  lOGb/s  Ethernet.  This  traffic  would  severely  impact  the  performanee  of  any 
other  parallel  application  requiring  use  of  the  switching  fabric,  such  as  the  Message  Passing 
Interface  (MPI)-based  eomputational  meehanics  applieations.  In  addition,  MNMI  researeh 
commonly  involves  investigations  of  MANET  security  concerns  such  as  wormhole  attacks, 
denial  of  service,  or  steganography.  Sinee  a  MANET  emulation  is  using  hardware  in  the  loop 
and  generating  aetual  network  traffic,  this  research  would  be  indistinguishable  from  a  real  attack 
and  can  cause  potential  problems  for  analysts  watching  the  production  networks  for  suspicious 
activities. 

Running  EMANE  with  any  of  these  virtualization  methods  requires  some  advaneed  networking 
configurations.  This  can  be  aehieved  through  a  few  methods  sueh  as  creating  a  tun/tap  interfaee 
or  bridging  an  existing  interface.  A  third  method  is  available  within  EMANE  when  virtualization 
is  not  used.  However,  this  method  requires  opening  a  physical  interface  in  promiscuous  mode, 
thus  also  requiring  super-user  access.  Strict  security  postures  in  shared-resource  computing 
systems  precluded  their  use  for  this  type  of  use. 

The  preceding  discussion  provides  the  basis  for  the  aequisition  of  Thufir  and  describes  how  the 
system  is  used  to  enhance  MANET  simulation  and  emulation  within  ARL.  Thufir  provides  a 
very  low  latency  10  GigE  network  for  over-the-air  (OTA)  eommunieations  between  radios, 
real-time  RE  propagation  eapabilities  through  the  456  Nvidia  M2070  GPUs,  a  high  speed  and 
capaeity  storage  system  from  Panasas,  and  6576  AMD  Interlagos  CPU  cores  for  hosting  virtual 
maehines  (VMs)  and  running  radio  models.  Other  MANET  support  functions  have  also  been 
identified,  such  as  large-scale  data  reduction  and  mining  for  support  of  experimentalists  and 
analysts.  The  following  sections  detail  the  installation,  testing,  and  aceeptanee  phases  of  Thufir’ s 
integration  into  the  MNMI. 
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3.1  System  Specification 

•  2.6  GHz  12c  Interlagos  AMD  CPUs 

o  160  compute  2S  nodes,  2.67  GB  memory/eore 
o  114  GPU  2S  nodes,  2.67  GB  memory/eore 
o  24  eores/node  and  64  GB  memory/node 
o  6576  total  eompute  eores 

o  456  Nvidia  M2070  GPUs,  4  GPUs  per  GPU  node,  3  PCle  x  16  bus 

•  10  GigE  eluster  intereonneet 

o  10  Gb/s  and  40  Gb/s  with  Gnodal  switehes 
o  2-stage  fat  tree 

o  Switeh  port-to-port  lateney  -'282  ns  min,  -546  ns  max 
o  Non-oversubseribed 

•  (2)  40  TB  Panasas  PAS  12  shelves,  dedieated  1  GigE  intereonneet 
o  2  Eogin  and  2  management  nodes,  management  network 

3.2  Installation 

The  high  density  of  the  eompute  eapabilities  of  GPGPUs  also  requires  faeilities  with  signifieant 
power  and  eooling  available.  Eaeilities  requirements  for  a  system  sueh  as  this  inelude  raised 
floors  for  wiring  and  eooling  with  load  limits,  a  reliable  souree  of  high-voltage  eleetrieity,  and 
signifieant  air-  or  water-eooling  eapabilities.  Eor  instanee,  this  system  requires  a  maximum  of 
201  kW  of  power  and  60  ton  of  eooling.  In  addition  to  the  supply  of  power  and  eooling  to  this 
system,  the  network  eonneetivity  is  an  important  issue  for  remote  usage.  Remote  users  from 
other  faeilities  may  require  aeeess,  whieh  means  that  these  users  need  elearanee  eheeks  on  reeord 
and  signed  user  agreements.  The  power  requirements  of  up  to  201  kW,  ineluding  power  baekup 
using  generators,  require  signifieant  loeal  faeilities  support.  Eortunately,  the  APG  DOD  Shared 
Resouree  Center  site  has  signifieant  faeilities  investments  and  expertise  already  installed  and 
available  for  use.  The  initial  installation  and  power  up  of  the  system  went  smoothly  and  the 
DHPl  system  was  eo-loeated  with  other  HPC  assets  to  leverage  the  eurrently  available  faeilities. 

3.3  Initial  Acceptance  Testing 

As  of  this  writing,  in  2014,  it  is  apparent  that  HPC  systems  for  the  foreseeable  future  are  likely  to 
inelude  some  type  of  many-eore  solution  sueh  as  the  Intel  MIC  (many  integrated  eore)  or  GPUs 
sueh  as  the  Nvidia  Tesla  eoproeessors.  Both  of  these  teehnologies  aet  as  eo-proeessors  and 
therefore  add  another  level  of  eomplexity  in  developing  and  testing  application  execution  and 
performance.  The  aeeeptanee  plan  developed  for  this  system  foeused  on  these  nontraditional 
teehnologies  that  are  most  likely  to  eause  instabilities  or  have  performanee  problems. 


6 


The  initial  acceptance  test  suite  consisted  of  a  series  of  tests  to  check  performance  and 
capabilities  of  the  system.  These  tests  were  mainly  taken  from  the  technology  insertion  test  suite 
used  for  acceptance  testing  of  the  distributed  shared  resource  center  acquisitions.  Additional  tests 
included  compute  unified  device  architecture  (CUBA),  OpenCL,  and  performance  of  multicast 
traffic  on  the  network  fabric.  A  5000-node  wireless  MANET  emulation  using  EMANE  was  used 
to  analyze  system  capabilities  to  simultaneously  support  5000  VMs.  The  initial  tests  were 
performed  during  the  second  half  of  February  2012.  These  tests  were  as  follows: 

1 .  Demonstrate  that  the  nodes  that  are  equipped  with  dual  redundant  power  supplies  will 
continue  to  function  in  the  event  one  is  removed  (Twin^2  compute  nodes,  Admin  nodes, 
login  nodes). 

2.  Demonstrate  that  the  system  can  be  brought  to  an  orderly  halt  while  preserving  the  file 
systems. 

3.  Demonstrate  that  the  system  can  be  brought  back  up  after  a  full  stop  and  reconnect  to  the 
file  system. 

4.  Demonstrate  that  files  can  be  exchanged  between  the  cluster  and  another  HPC  system. 

5.  Demonstrate  that  the  system  supports  C,  C++,  and  FORTRAN  using  the  PGI  Compiler 
Suite. 

6.  Demonstrate  that  the  batch  job  scheduler,  in  this  case  Grid  Engine,  is  operational. 

7.  Demonstrate  that  the  system  supports  C,  C++,  and  FORTRAN  using  the  Intel  Compiler 
Suite. 

8.  Demonstrate  that  the  system  supports  OpenCL. 

9.  Demonstrate  that  the  system  supports  CUBA. 

10.  Demonstrate  that  the  system  supports  C  and  FORTRAN  bindings  for  MPI  over  the 
10  GigE  network  using  OpenMPI  or  its  equivalent. 

1 1 .  Demonstrate  the  network  latency  from  different  locations  within  the  cluster  such  as  node  to 
node  and  VM  to  VM. 

12.  Demonstrate  the  network  bandwidth. 

13.  Demonstrate  the  network  multicast  scalability. 

14.  Demonstrate  that  the  system  supports  IPv4  and  IPv6  dual  stack  functionality. 

15.  Demonstrate  the  Panasas  features  and  performance  including  input/output  bandwidth  and 
operation  from  a  VM. 
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16.  Demonstrate  GPU  performance  (NBODY). 

17.  Demonstrate  that  the  EMANE  application  can  be  successfully  run  on  the  cluster. 

3,4  Stability  Issues 

Eollowing  base  performance  and  capability  testing,  Thufir  began  effectiveness  testing. 
Effectiveness  testing  requires  a  97%  system  availability  during  a  30-day  period  with  full  system 
utilization,  calculated  using  the  following  equation: 

OperationalUseTime{Processor  hours') 

xiX-/  —  1 00 

ScheduledUseTime(Processor  hours)  ^ 

where  EL  is  the  effectiveness  level,  or  system  availability  of  the  complete  cluster.  In  addition  to 
the  CPU  cores,  Thufir  includes  a  number  of  GPUs  that  draw  the  lion’s  share  of  power  and 
generate  more  heat  than  the  rest  of  the  devices  in  the  system.  During  the  run-up  to  effectiveness 
testing,  we  identified  a  stability  issue  when  miming  applications  that  attempted  to  simultaneously 
use  all  of  the  GPUs,  namely  four,  in  a  single-system  node.  Identifying  the  reason  for  the 
instability  was  not  trivial.  A  more  detailed  discussion  and  timeline  of  the  vendor-supplied 
support  is  detailed  in  section  3.6. 

The  GPUs  supported  the  execution  of  CUDA  and  OpenCL  applications  during  the  initial 
acceptance  testing.  These  tests  included  short  computational  kernels  and  ran  serially  across  the 
GPUs  in  the  system.  An  OpenCL/MPI  application  that  computes  multibody  interactions  was 
executed  in  parallel  across  all  GPUs  during  effectiveness  testing.  Initially,  the  GPUs  appeared  to 
fail  randomly  with  the  error  “NVRM:  Xid  (0000:07:00):  48,  An  uncorrectable  double  bit  error 
has  been  detected  on  GPU  (00  03  00)”  followed  by  an  error  stating  that  the  device  “fell  off  the 
bus.”  After  these  errors  occurred,  the  GPUs  would  be  unusable  and  the  system  would  require  a 
hard  reboot  in  order  to  return  to  a  usable  state.  The  vendor  eventually  identified  that  the  issue 
was  related  to  the  12V  power  backplane  printed  circuit  board  (PCB),  figure  1. 


Figure  1 .  (a)  Image  showing  the  trace  on  the  power  backplane  PCB  that  needed  to  be  severed,  (b)  Close-up  view 
of  trace  to  be  severed. 
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Prior  to  severing  the  PCB  trace,  the  power  backplane  was  only  able  to  supply  45  A  of  current  to 
the  system,  resulting  in  the  GPU  instabilities  observed.  After  the  trace  was  severed,  the  power 
backplane  was  able  to  provide  the  full  60-A  current  that  the  power  rail  was  rated  to  provide. 

3,5  Performance  Issues 

Multiple  performance  issues  initially  plagued  the  DHPI  system  as  it  was  spun  up  for 
effectiveness  testing.  These  were  related  to  the  Interlagos  architecture,  the  MPI  implementation, 
and  the  10  GigE  network.  Eventually,  solutions  to  each  of  these  issues  were  applied  through 
literature  searches,  software  updates,  and  system  configuration  changes. 

The  first  performance  issue  observed  involved  job  scheduling  on  the  Interlagos  many-core 
processors.  Each  of  the  compute  and  graphics  nodes  in  Thufir  contains  two  12-core  AMD 
Opteron  6238  processors.  This  means  that  if  threads  are  appropriately  allocated,  the  first  12 
threads  will  have  dedicated  cache  and  floating  point  units.  When  utilizing  13  to  24  cores  per 
node,  performance  is  expected  to  decrease  slightly.  With  the  shared  cache,  cache  misses  can  be 
an  issue.  Initial  testing  showed  the  cache  misses  to  increase  significantly  if  more  than  12  cores 
were  utilized  as  shown  in  figures  2  and  3  for  sysctl  parameters  and  parallelization  strategy, 
respectively. 


Number  of  execution  threads 

Figure  2.  Cache  misses  vs.  number  of  threads  per 
node  showing  significant  increase  in 
misses  when  more  than  12  threads  are 
executing  simultaneously.  Old  and  new 
refer  to  before  and  after  setting  the  sysctl 
parameter  kernel .  randomize 
_va_space  from  2  (old)  to  0  (new). 
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Figure  3.  Plot  of  speedup  vs.  number  of  execution 
threads  using  pthreads,  fork,  and  MPI. 
Comparing  the  performance  of  the 
multithreaded  applications  vs.  a  tightly 
coupled  MPI-based  SIMD  application 
illustrates  the  effect  of  LI  instruction 
cache  misses  in  the  MPI  library. 


A  white  paper  by  AMD  detailed  an  issue  with  instruction-eaehe  eross-invalidations  that  were  the 
primary  cause  of  these  cache  misses  (6).  At  the  hardware  level,  the  first  line  of  the  level- 1 
instruction  cache  is  invalidated  when  two  cache  lines  with  the  same  physical  address  are  loaded. 
This  can  occur  when  the  same  code  is  executing  on  both  cores  of  a  single  compute  unit  in  tight 
loops.  This  is  exactly  what  occurs  in  single  instruction  multiple  data  (SIMD  )  applications  like 
those  we  typically  execute,  e.g.,  finite  element  methods,  finite  difference,  and  molecular 
dynamics. 

A  temporary  solution  is  to  turn  off  the  address-space  layout  randomization  (ASLR)  using  the 
following  command  (6): 

#  sysctl  -w  kernel . randomize_va_space=0 


After  disabling  the  ASLR  and  re-running  the  benchmark  tests,  we  observed  a  remarkable 
improvement  in  performance,  indicating  that  the  scaling  issue  had  been  identified.  Since 
disabling  the  ASLR  can  decrease  the  system’s  level  of  security,  a  permanent  solution  was 
required.  The  recommended  solution  was  to  upgrade  the  Linux  kernel  to  3.2-rcl  or  higher.  We 
achieved  the  final  solution  by  updating  the  entire  system  from  Red  Hat  Enterprise  Linux  6.1  to 
6.2  and  implementing  the  kernel  patch.  Figure  4  shows  the  marked  improvement  in  performance. 
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Figure  4.  Comparison  of  application  speedup  using 
unpatched  kernel  with  ASLR  turned  on 
(gcc+intelmpi  RH  6.1)  and  off 
(gcc+intelmpi  RH  6.1  +  sysctl  mods)  and 
the  patched  kernel  (gcc+intelmpi  RH  6.2). 

A  small  performance  drop  when  using  more  than  one  core  per  compute  unit  can  be  observed  in 
figure  4,  but  application  performance  is  greatly  improved  over  the  unpatched  kernel.  This 
performance  drop  may  be  minimized  in  the  future  by  further  tuning  for  the  Interlagos 
architecture  with  the  fused  multiply  add  instruction  and  optimizations  for  cache  usage.  An 
interesting  discussion  of  the  shared  floating  point  unit  can  be  found  in  the  AMD  Opteron  6200 
Tuning  guide  (7). 

“The  floating  point  unit  is  capable  of  producing  four  double-precision  FLOPS  per 
cycle  per  clock  cycle  simultaneously  to  each  core  in  a  pair  for  a  total  of  eight  per 
core  pair  per  cycle.  This  is  comparable  to  the  floating  point  performance  per  core  per 
cycle  of  an  AMD  Opteron™  6100  CPU.  But,  unlike  prior  CPUs,  when  one  core  is 
issuing  fewer  floating  point  instructions,  the  other  core  in  the  pair  can  use  its  four 
FLOPS/cycle  plus  any  unused  by  the  other  core  to  fully  exploit  the  capacity  of  the 
Flex  FP.  For  example,  in  the  extreme  case  of  one  core  executing  no  floating  point 
instructions,  the  other  core  of  the  pair  could  achieve  up  to  8  double  precision  floating 
point  operations  per  cycle.” 

The  nonuniform  memory  access  (NUMA)  configuration  of  the  Thufir  compute  nodes  is  given 
with  the  numactl  command: 
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]  $  numactl  --hardware 
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where  a  NUMA  node  is  shown  in  figure  la  with  eight  eores  as  opposed  to  the  six  eores  available 
in  the  Thufir  system.  The  node  distanees  reported  for  the  adjaeent  nodes  in  the  compute  nodes 
without  GPUs  is  16,  or,  to  put  this  another  way,  the  cost  to  access  data  located  on  an  adjacent 
node  is  about  60%  greater  than  accessing  local  memory.  As  shown  in  the  following  example,  this 
cost  increases  to  about  100%,  increasing  from  a  distance  of  16  to  20  for  graphics  nodes  that 
contain  GPUs. 


[bjhenz@thuf irg-0001  ~] $  numactl  --hardware 
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The  network  fabric  of  this  system  is  based  on  10  GigE  Ethernet,  as  opposed  to  other  high- 
performance  network  interconnects  such  as  Infiniband.  This  choice  of  interconnect  is  based  on 
the  primary  use  of  this  system,  namely  emulation  of  mobile  ad-hoc  networks  that  require 
Ethernet  for  the  transport  of  emulated  OTA  network  traffic.  Secondary  applications  used  on  the 
DHPl  include  GPU-based  RE  path  loss  calculations,  network  simulation,  and  other  physics- 
based  simulations  that  use  MPl  for  message  passing.  MPl  over  Ethernet  typically  involves  the 
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Transmission  Control  Protocol  (TCP)/IP  stack,  increasing  overhead,  ineluding  CPU  utilization 
and  poorer  performanee  in  general.  For  this  reason,  hardware  vendors  have  developed  their  own 
methods  to  reduee  the  operating  system  (OS)  overhead. 

A  number  of  hardware  vendors  have  developed  remote  direet  memory  aeeess  (RDMA)  methods 
that  bypass  the  OS  staek  and  system  drivers  to  improve  lateney  and  network  throughput. 
Examples  inelude  iWarp  from  Intel,  RoCE  from  Mellanox,  and  MX  from  Myrieom.  Thufir  is 
eonfigured  with  Mellanox  ConneetX  EN  adapter  boards  for  10  GigE  eonneetivity.  RoCE  is 
installed,  but  as  of  this  writing,  we  have  been  unsuceessful  in  using  RoCE  over  the  Gnodal 
switehes  for  simultaneous  eommunieations  between  more  than  two  nodes.  The  use  of  MPI  over 
TCP  works  well,  but  performanee  is  expeeted  to  improve  onee  the  RDMA  implementation  is 
fully  usable  (5). 

3,6  Final  Acceptance  and  Entry  Into  Production 

We  diseovered  several  important  issues  throughout  final  aeoeptanee  testing  and  system  entry  to 
production  that  may  have  been  identified  earlier  through  more  thorough  aeoeptanee  testing.  Our 
foous  throughout  development  of  the  aeoeptanee  test  plan  and  aeoeptanee  testing  has  been  on 
proper  applioation  exeoution  and  system  interaotions  with  a  seoondary  foous  on  aohieving 
maximum  system  performanee.  After  aeoeptanee,  we  identified  some  of  the  performance  issues 
disoussed  previously  that  have  prevented  100%  system  utilization,  inoluding  RDMA  over 
Ethernet,  oompilers,  operating  systems,  and  general  system  tweaking.  We  reoommend  the 
possible  inolusion  of  some  stressful  performanee  metrios  during  future  aeoeptanee  testing. 
However,  the  inolusion  of  suoh  metrics  will  increase  the  oost  and  time  of  system  aeoeptanee.  It 
oould  potentially  limit  the  amount  of  hardware  that  oould  be  purohased  and  ultimately  affeot 
system  performanee  because  additional  funds  must  be  allooated  to  testing.  This  prooess  has  been 
a  delioate  balanoing  aot  between  the  budget,  applioation  requirements,  and  the  amount  of  system 
support  required. 

Earge  soale-MANET  emulations  have  been  performed  on  the  system  during  testing  and  initial 
production,  including  an  emulation  with  5000  wireless  nodes.  Applioations  available  for  the 
GPUs,  inoluding  RE  propagation  path  loss  using  the  ITM  and  ray  traoing  algorithms,  have 
exeouted  suooessfully  with  expeeted  performanee  aohieved.  The  GPU  applioations  are  primarily 
independent  of  the  network  interconneots  and  the  MANET  emulation  uses  TCP/IP  for 
oommunioations  between  nodes.  Performanee  issues  have  been  limited  to  seoondary  applioations 
using  the  oluster  that  require  high-performanoe  MPI  implementations.  In  addition  to  initially 
poor  prooess  thread  plaoement  on  the  CPU  oore,  the  RDMA  over  Ethernet  implementation 
installed  on  the  system  has  failed  to  perform  aoross  more  than  two  nodes  simultaneously.  These 
issues  have  been  solved  through  either  OS  and  kernel  updates  or,  as  in  the  ease  of  RDMA  over 
Ethernet,  are  being  resolved  through  the  device  vendor  support. 
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4.  Conclusions 


This  report  details  many  of  the  issues  encountered  during  installation  and  acceptance  of  the 
CPU/GPU  hybrid  cluster  Thufir  located  at  the  U.S.  Army  Research  Laboratory,  APG,  MD.  The 
uniqueness  of  this  system  lies  in  the  use  of  the  low  latency  10  GigE  network  fabric,  many-core 
(12)  AMD  Opteron  CPUs,  Nvidia  Fermi  GPUs,  and  a  mixture  of  CPU-only  and  CPU/GPU 
nodes. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


ASLR 

address-space  layout  randomization 

APG 

Aberdeen  Proving  Ground 

ARL 

U.S.  Army  Research  Eaboratory 

CPU 

central  processing  unit 

CUDA 

compute  unified  device  architecture 

DHPI 

dedicated  high  performance  computing  project  investment 

DOD 

Department  of  Defense 

EMANE 

extendable  mobile  ad-hoc  network  emulator 

GigE 

gigabit  Ethernet 

GPGPU 

general  purpose  graphics  processing  unit 

GPU 

graphics  processing  unit 

HPC 

high  performance  computing 

IP 

Internet  Protocol 

ITM 

Irregular  Terrain  Model 

MANET 

mobile  ad-hoc  network 

MNMI 

Mobile  Network  Modeling  Institute 

MPI 

Message  Passing  Interface 

NUMA 

nonuniform  memory  access 

OS 

operation  system 

OTA 

over  the  air 

PCB 

printed  circuit  board 

RDMA 

remote  direct  memory  access 

RE 

radio  frequency 

SIMD 

single  instruction  multiple  data 
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SRW 

Soldier  Radio  Waveform 

TCP 

Transmission  Control  Protoeol 

TI 

teehnology  insertion 

VM 

virtual  maehine 

WNW 

Wideband  Networking  Waveform 
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